Gemini Agentic Vision: AI Now Analyzes Images with Code

Ender Öztürk

2 months önce

Google is pushing the boundaries of artificial intelligence with a groundbreaking new feature for its Gemini 3 Flash model. The company announced Agentic Vision, a powerful capability designed to make visual tasks significantly more accurate and reliable. This system bases its responses on direct visual evidence rather than predictions, aiming to minimize the margin of error in complex analyses.

How Gemini Agentic Vision Changes AI Perception

Unlike standard AI models that process the world from a single, static viewpoint, Agentic Vision transforms this process into an “active investigation.” Previously, if a model missed a fine detail, such as a serial number on a microchip or a distant sign, it would be forced to guess the result. However, Google’s new approach empowers the model to do more than just look; it combines its visual reasoning skills with code execution tools to analyze an image in detail.

Gemini 3 Flash now creates step-by-step plans to best respond to visual prompts, including actions like zooming, inspecting, and processing parts of an image. This process utilizes a “Think, Act, Observe” cycle. First, the model analyzes the user’s request and forms a plan. Next, it uses Python code to perform actions like cropping or rotating the image. Finally, it observes the transformed image to place it in context before generating the final answer.

Beyond Description: A Visual Scratchpad for Accuracy

The model doesn’t just verbally describe an image; it can also draw directly onto the canvas to ground its reasoning process. For instance, to avoid errors when counting fingers on a hand, it can add bounding boxes and numerical labels over each finger. This “visual scratchpad” method ensures pixel-level precision and prevents common counting mistakes. Additionally, the model can automatically zoom in on fine details and analyze dense data tables.

While standard language models often fail at complex visual-mathematical tasks, Gemini 3 Flash overcomes this by offloading calculations to a deterministic Python environment. This replaces probabilistic guesses with verifiable and precise operations. Consequently, Agentic Vision provides a consistent quality improvement of 5% to 10% across most visual benchmarks. This feature is currently available to developers through Google AI Studio and Vertex AI and has started rolling out to the Gemini application. Google plans to further expand the model’s understanding of the world by integrating it with tools like web search and reverse image search in the future.

So, what are your thoughts on Gemini’s new Agentic Vision? Share your opinions with us in the comments!

How Gemini Agentic Vision Changes AI Perception

Beyond Description: A Visual Scratchpad for Accuracy

Yorum Ekleyin